Welcome!

Who are we?

Jonathan Keane

  • Engineering leader at Posit, PBC.

  • 15+ years building data tools in R and Python for scientific computing.

  • PMC member and maintainer of Apache Arrow; author of dittodb

  • Experienced with large-scale data analysis, modeling, and enterprise tools

Tyson Barrett

  • Applied statistician at Highmark Health and Utah State University

  • 15+ years of R programming and package development experience

  • Maintainer of data.table and 3 other R packages

  • Consultant on NSF grant supporting data.table infrastructure

  • Works with large datasets (millions of rows, hundreds of columns)

Kelly Bodwin

  • Associate Professor of Statistics and Data Science at Cal Poly

  • Co-author of R packages flair and tidyclust

  • Consultant on NSF grant supporting data.table infrastructure

  • Research experience with high-volume, in-memory data

Setup

Setup

Install some packages

Installs

(Also available as “installs.R” in the workshop materials repository.)

install.packages("pak")

pak::pak(c(
  "data.table",
  "arrow",
  "duckdb",
  "readr",
  "here",
  "glue",
  "tidyr",
  "dplyr",
  "dtplr",
  "duckplyr",
  "tictoc",
  "microbenchmark",
  "bench",
  "lobstr",
  "nanoparquet"
))

Get the Data

Easiest, quickest option

Download a subset of the data

  • This subset only includes the person-level data for years 2005, 2018, 2021 and only for states Alaska, Alabama, Arkansas, Arizona, California, Washington, Wisconsin, West Virginia, and Wyoming.

  • Simply download it and unzip it into a directory called data in your working directory and you can run the examples in the workshop.

Longer, but full dataset option

We also host a full version of the dataset in AWS S3.

Once you have setup your AWS account and CLI, download the data into a data directory to use:

aws s3 cp --recursive s3://scaling-arrow-pums/ ./data/

This is the full dataset, but does require that you setup your AWS CLI and wait for the dataset to be downloaded.

The Public Access Microdata dataset

About the data

  • Collected by the United States Census Bureau as part of the American Community Survey
  • Disclosure protection — introduces noise to make it impossible to identify specific people or households
  • Covers: 2005–2022 using the 1-year estimates (sans 2020; COVID)
  • Split into person and household
    • columns: person: 230, household: 188
    • rows: person: 53M, household: 25M

A few example variables

  • Person
    • Language spoken at home (LANP)
    • Travel time to work (JWMNP)
  • Household
    • Access to internat (ACCESS)
    • Monthly rent (RNTP)
  • Weights 😵‍💫
    • PWGTP and WGTP for weights

Format of the data

  • Released and available as CSV files (~90GB)
  • Uses survey-style coding

For this workshop:

  • Recoded the dataset
  • Saved as parquet (~12GB) partitioned by year and state

Can I analyze all of PUMS?

Most analysis of PUMS data starts with subsetting the data. Either by state (or even smaller) or year and often both.

But with the tools we learn about in this workshop, we actually can analyze the whole dataset.

Caveat

Though we have not purposefully altered this data, this data should not be relied on to be a perfect or even possibly accurate representation of the official PUMS dataset.

Goals

Goals of this Workshop

  1. Help you navigate when you need a speed-up trick and which tools will help you.

  2. Get you off the ground using data.table for faster operations on large data fully in R.

  3. Show you how to set up a duckdb database and use arrow and duckplyr to partition your analysis.

  4. Give you tools for a unified workflow of these tools.

Where are you from, what do you work on, and how do you hope this workshop will be useful to you?